Understand unicode and UTF

您所在的位置：网站首页 › understand 中文乱码 › Understand unicode and UTF

Understand unicode and UTF

2024-07-03 04:00| 来源: 网络整理| 查看: 265

Unicode is a character set - 字符集，它又叫万国码，这个名字比较对于我们比较生动了吧，就是说在它里面，每个国家语言的每个字符都有对应的唯一编号。它也预留了一些编号去兼容以前的旧字符编号。。

UTF-8先将字符编号转换为4个8-bit的元组二进制，才可存储到硬盘上。反之，从硬盘读取二进制数据转换为字符编号。一个中文字符有3个字节（3bytes），英文是1一个字节。所以它是可变长度的。

(UTF-16 is a bit diff but it's better for in-memory storage.. to explore it later.)

在公司打中文真辛苦。。。

Paste a url to understand more detail :

http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8

== quoted ==

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

UTF-8 is an encoding - Unicode is a character set

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100

Our data is now translated into binary and can now be saved to disk.

h2. All together now

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111

The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111

Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".

Conclusion

So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:

UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate binary data into numbers. Unicode is a character set used to translate numbers into characters.

== unquoted ==

【本文地址】

Understand unicode and UTF

Understand unicode and UTF

今日新闻

推荐新闻